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Abstract 

In this paper, we study the problem of estimating uniformly well the mean values of several distributions 
given a finite budget of samples. If the variance of the distributions were known, one could design an optimal 
sampling strategy by collecting a number of independent samples per distribution that is proportional to 
their variance. However, in the more realistic case where the distributions are not known in advance, one 
needs to design adaptive sampling strategies in order to select which distribution to sample from according 
to the previously observed samples. We describe two strategies based on pulling the distributions a number 
of times that is proportional to a high-probability upper-confidence-bound on their variance (built from 
previous observed samples) and report a finite-sample performance analysis on the excess estimation error 
compared to the optimal allocation. We show that the performance of these allocation strategies depends 
not only on the variances but also on the full shape of the distributions. 


Keywords: Bandit Theory, Active Learning 


1. Introduction 

Consider a marketing problem where the objective is to estimate the potential impact of several new 
products or services. A common approach to this problem is to design active online polling systems, where 
at each time a product is presented (e.g., via a web banner on Internet) to random customers from a 
population of interest, and feedbacks are collected (e.g., whether the customer clicks on the ad or not) and 
used to estimate the average preference of all the products. It is often the case that some products have a 
general consensus of opinion (low variance) while others have a large variability (high variance). While in 
the former case very few votes would be enough to have an accurate estimate of the value of the product, in 
the latter the system should present the product to more customers in order to achieve the same accuracy. 
Since the variability of the opinions for different products is not known in advance, the objective is to design 
an active strategy that selects which product to display at each time step in order to estimate the values of 
all the products uniformly well. 

The problem of online polling can be seen as an online allocation problem with several options, where the 
accuracy of the estimation of the quality of each option depends on the quantity of the resources allocated 
to it and also on some (initially unknown) intrinsic variability of the option. This general problem is closely 
related to the problems of active learning [^, Q, sampling and Monte-Carlo methods [1^ . and optimal 
experimental design [na. A particular instance of this problem is introduced in [ij as an active learning 
problem in the framework of stochastic multi-armed bandits. More precisely, the problem is modeled as a 
repeated game between a learner and a stochastic environment, defined by a set of K unknown distributions 
{^fc}(Li, where at each round t, the learner selects an action (or arm) kt and as a consequence receives a 
random sample from v^t (independent of the past samples). Given a total budget of n samples, the goal is to 
define an allocation strategy over arms so as to estimate their expected values uniformly well. Note that if 
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the variances {cr^J^^i of the arms were initially known, the optimal allocation strategy would be to sample 
the arms proportionally to their variances, or more accurately, proportionally to Afe = (j\l a^. However, 
since the distributions are initially unknown, the learner should follow an active allocation strategy which 
adapts its behavior as samples are collected. The performance of this strategy is measured by its regret 
(defined precisely by Equation!?]) that is the difference between the maximal expected quadratic estimation 
error of the algorithm and the maximal expected error of the optimal allocation. 

Antes et al. [i| presented an algorithm, called GAFS-MAX, that allocates samples proportionally to 
the empirical variances of the arms, while imposing that each arm should be pulled at least times (to 
guarantee good estimation of the true variances), where n is the total budget of pulls. They proved that for 
large enough n, the regret of their algorithm scales with and conjectured that this rate is optimal0 

However, the performance displays both an implicit (in the condition for large enough n) and explicit (in the 
regret bound) dependency on the inverse of the smallest optimal allocation proportion, i.e., Amin = rninfe A*,. 
This suggests that the algorithm is expected to have a poor performance whenever an arm has a very small 
variance compared to the others. Whether this dependency is due to the analysis of GAFS-MAX, to the 
specific class of algorithms, or to an intrinsic characteristic of the problem is an interesting open question. 
One of the main objectives of this paper is to investigate this issue and identify under which conditions this 
dependency can be avoided. Our main contributions and findings are as follows: 

• We introduce two new algorithms based on upper-confidence-bounds (UCB) on the variance. 

• The first algorithm, called GH-AS, is based on Ghernoff-Hoeffding’s bound, whose regret has the rate 
0(n~^/‘^) and inverse dependency on Amin, similar to GAFS-MAX. The main differences are: the 
bound for CH-AS holds for any n (and not only for large enough n), multiplicative constants are made 
explicit, and finally, the proof is simpler and relies on very simple tools. 

• The second algorithm, called B-AS, uses a sharper inequality than CH-AS, and has a better per¬ 
formance (in terms of the number of pulls) in targeting the optimal allocation strategy without any 
dependency on Amin- However, moving from the number of pulls to the regret causes the inverse 
dependency on Amin to appear in the bound again. We show that this might be due to specific shape 
of the distributions {vk}^^i and derive a regret bound independent of Amin for the case of Gaussian 
arms. 

• We show empirically that while the performance of CH-AS depends on Amin in the case of Gaussian 
arms, this dependence does not exist for B-AS and GAFS-MAX, as they perform well in this case. 
This suggests that 1) it is not possible to remove Amin from the regret bound of CH-AS, independent 
of the arms’ distributions, and 2) GAFS-MAX’s analysis could be improved along the same line as the 
proof of B-AS for the Gaussian arms. We also report experiments providing insights on the (somehow 
unexpected) fact that the full shapes of the distributions, and not only their variances, impact the 
regret of these algorithms. 


2. Preliminaries 

The allocation problem studied in this paper is formalized as the standard AT-armed stochastic bandit 
setting, where each arm k = l,...,Ar is characterized by a distribution Vk with mean fXk and non-zero 
variance > 0. At each round t > 1, the learner (algorithm A) selects an arm kt and receives a sample 
drawn from vj-t independently of the past. The objective is to estimate the mean values of all the arms 
uniformly well given a total budget of n pulls. An adaptive algorithm defines its allocation strategy as a 
function of the samples observed in the past (i.e., at time t, the selected arm kt is a function of all the 
observations up to time t — 1). After n rounds and observing = ^*1 samples from each 


^The notation Un = 0{vri) means that there exist C > 0 and a > 0 such that Un < C(logn)“u„ for sufficiently large n. 
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arm k, the algorithm A returns the empirical estimates jlk,n =-where Xk^t denotes the sample 

Tk,n 

received when we pull arm k for the t-th time. The accuracy of the estimation of each arm k is measured 
according to its expected squared estimation error, or loss 


Tk,n 


Lk_n = El- 




The global performance or loss of A is defined as the worst loss of the arms 

Ln{A) = max Lfe „ . 

l<k<K 


( 1 ) 

( 2 ) 


If the variance of the arms were known in advance, one could design an optimal static allocation (i.e., 
the number of pulls does not depend on the observed samples) by pulling the arms proportionally to their 
variances. In the case of static allocation, if an arm k is pulled a fixed number of times its loss is 

computed a^ 


Lk,n — 


T, 


k,n 


( 3 ) 


By choosing ^ so as to minimize Ln under the constraint that n = optimal static allocation 

(T^n 

strategy A* pulls each arm k (up to rounding effects) ^2 times, and achieves a global performance 

Ln(,A*) = S/n, where E = ^1- We denote by \k = the optimal allocation proportion for 

arm k, and by Amin = inini<fe<x A/c, the smallest such proportion. 

In our setting where the variances of the arms are not known in advance, the exploration-exploitation 
trade-off is inevitable: an adaptive algorithm A should estimate the variances of the arms (exploration) at 
the same time as it tries to sample the arms proportionally to these estimates (exploitation). In order to 
measure how well the adaptive algorithm A performs, we compare its performance to that of the optimal 
allocation algorithm A*, which requires the knowledge of the variances of the arms. For this purpose, we 
define the notion of regret of an adaptive algorithm A as the difference between its loss Ln(A) and the 
optimal loss Ln(A*), i.e., 

Rn(A)=Lr,(A)-Lr,(A*)- ( 4 ) 


It is important to note that unlike the standard multi-armed bandit problems, we do not consider the notion 
of cumulative regret, and instead, use the excess-loss suffered by the algorithm at the end of the n rounds. 
This notion of regret is closely related to the pure exploration setting (e.g., [^[^). An interesting feature 
that is shared between this setting and the problem of active learning considered in this paper is that good 
strategies should play all the arms as a linear function of n. This is in contrast with the standard stochastic 
bandit setting, at which the sub-optimal arms should be played logarithmically in n. 

In [l[, the authors provide an algorithm called GAFS-MAX and they prove that its regret is such that 
Rn(AGAFS-MAx) = 0(n~^^‘^) for a large enough budget n that depends on Amin- Also, the O depends on 
Amin- The smaller Amin, the larger n needs to be so that the bound in 0(n“^/^) holds, and also the larger 
the constant in the O. 


3. Allocation Strategy Based on Chernoff-HoefFding UCB 

The first algorithm, called Chernoff-Hoeffding Allocation Strategy (CH-AS), is based on a Chernoff- 
Hoeffding high-probability bound on the difference between the estimated and true variances of the arms. 
Each arm is simply pulled proportionally to an upper-confidence-bound (UCB) on its variance. This al¬ 
gorithm deals with the exploration-exploitation trade-off by pulling more the arms with higher estimated 
variances or higher uncertainty in these estimates. 


^This equality does not hold when the number of pulls is random, e.g., in adaptive algorithms where the strategy depends 
on the random observed samples. 
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Input: parameter 5 
Initialize: Pull each arm twice 
for t = 2K + 1,..., n do 

Compute = t^, \ i each arm 1 < k < K 

Pull an arm kt € argmaxj<j.<jf Bk,t 

end for 

Output: fik,n for all arms 1 < k < K 


Figure 1: The pseudo-code of the CH-AS algorithm, with ^ computed as in Equation [5] 


3.1. The CH-AS Algorithm 

The CH-AS algorithm Ach in Fig. [T] takes a confidence parameter 5 as input and after n pulls returns 
an empirical mean (lk,n for each arm k. At each time step t, i.e., after having pulled arm kt, the algorithm 
computes the empirical mean and variance a'^ ^ of each arm k a^ 

^ Tk,t ^ Tk,t 

flk,t = 7 ^ ^ ^ A.k,i and dj, ^ ^ ^ ^k,i ~ f^k,t ) (^) 

^ k t ■ 

’ 1—1 ’ 2=1 

where is the f-th sample of Vk and Tk^t is the number of pull^ allocated to arm k up to time t. After 
pulling each arm twice (rounds t = 1 to 2K), from round t = 2K 1 on, the algorithm computes the 
values based on a Chernoff-Hoeffding’s bound on the variances of the arms: 



and then pulls the arm kt with the largest B^^t- This bound relies on the assumption that the distributions 
Wk}k=i are supported [0,1]. 

Note that actually pLk.t-, ^k,t, B^^t, kt, and Tk^t depend on the arm index (except for kt), on the time step 
t < n, but also, either in a direct or in an indirect way (through the mechanism of the algorithm) on the 
budget n and on 6 which will be chosen as a function of the budget n. However, since we consider most of 
the time a fixed budget n and thus a fixed <5, we conserve this notation in order to have lighter notations. 


3.2. Regret Bound and Discussion 

Before reporting a regret bound for the CH-AS algorithm, we first analyze its performance in targeting 
the optimal allocation strategy in terms of the number of pulls. As it will be discussed later, the distinction 
between the performance in terms of the number of pulls and the regret will allow us to stress the potential 
dependency of the regret on the distribution of the arms (see Section IT75)) . 

Lemma I. Assume that the distributions are supported on [0,1] and let ^ > 0. Define the event 



l<t<n 


^Notice that this is a biased estimator of the variance even if the numbers of pulls t were not random. 

"^An accurate notation for this should be n since the number of pulls at time t depends also on n. However, for the sake 
of concision, we note T^^t- 
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The probability of is higher than or equal to 1 — AnKS. If n> 5K, the number of pulls by the 

CH-AS algorithm launched with parameter 6 satisfies on 


Afc ^ 


\2^Jn log(l/(5) 


sa: 


3/2 


+ ^k) < Tk,n - < 


12\/nlog(l/^) 


sa: 


3/2 


+ 4.K, 


( 6 ) 


for any arm I < k < K. 

Proof. The proof is reported in [Appendix A.^ □ 

We now show how the bound on the number of pulls translates into a regret bound for the CH-AS 
algorithm. 

Theorem 1. Assume that the distributions {r'k}k=i ore supported on [0,1]. If the fixed (known in advance) 
budget is such that n > 5K, the regret of Ach, when it runs with the parameter 5 = , is bounded as 


RniAcn) < 


39y^log(n) 2.9 x 10^ (logn)^/^ 

„ 3 / 2 \ 5/2 ^2 

/ V X'm I V-I 


. 11/2 

^min. 




(7) 


Proof. The proof is reported in Appendix A.3| It is mainly based on the last lemma and on the following 
inequality ('Eauation lA.131) : 


E 


r , , 2 . 

(Afc,n - < sup (-^jE[Tfe,„] . 


□ 


Remark 1. As discussed in Section[51 our objective is to design a sampling strategy capable of estimating the 
mean values of the arms almost as accurately as the estimations by the optimal allocation strategy, which 
assumes that the variances of the arms are known. In fact, Theorem [1] shows that the CH-AS algorithm 
provides a uniformly accurate estimation of the expected values of the arms with a regret Ru^Ach) of order 
This regret rate is the same as the one for the GAFS-MAX algorithm in Antos et al. [i|. Note 
also that this algorithm is efficient for a fixed horizon n, although it might be possible to change it so that 
it is efficient for any horizon. 

Remark 2. The bound displays an inverse dependency on the smallest optimal allocation proportion Amin- 
As a result, the bound scales poorly when an arm has a very small variance relative to the others, i.e., ak <C S. 
Note that GAFS-MAX (see [l|) has also a similar dependency on the inverse of Amin- Moreover, Theorem[T] 
holds for a budget n > 5K, whereas the regret bound of GAFS-MAX in [i| requires a condition n > no, 
in which no is a constant that scales with I/Amin- Finally, note that this UCB type of algorithm (CH-AS) 
enables a much simpler regret analysis than that of GAFS-MAX. 

Remark 3. It is clear from Lemma [T] that the inverse dependency on Amin appears in the bound on the 
number of pulls and then is propagated to the regret bound. We however believe that this dependency 
is not an artifact of the analysis and is intrinsic in the performance of the algorithm. Let us consider a 
two-arm problem with a\ = 1/4 and cr| = 0. The optimal allocation is = n — I, T|„ = 1 (only one 
sample is enough to estimate the mean of the second arm), and Amin = 0. In this case, the arguments 
used in proving Theorem [1] do not hold anymore and the bound itself becomes vacuous. We conjecture that 
the Chernoff-Hoeffding’s bound used in the upper-confidence term forces the CH-AS to pull the arm with 
zero variance at least times, where U is a positive constant, with high probability, which results in 

under-pulling the first arm by the same amount. As a result, the corresponding regret would have a rate of 
w.r.t. the budget n. This suggests that when Amin = 0 (or very small compared to I/n) CH-AS is 
still able to achieve a o(l/n) regret as the budget n increases but with a slower rate w.r.t. to result proved 
in Theorem [T] 
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Input: parameters ci, C 2 , 5 
Let a = V2ci log(c2/5) + 

V 6V ; T (i_5)^21og(2/«) 

Initialize: Pull each arm twice 


for t = 2K + 1,..., n do 

Compute Bq^t = +4a(Tq,t_i 

g,t — 1 \ 

Pull an arm kt € argmaxj^^^j^ Bq^t 

end for 

Output: fiq^t for all the arms 1 < q < K 


i°g(2/a) 


+ 4a=i2l(HZ£)') 

Tg,t-1 / 


for each arm 1 < q < 


Figure 2: The pseudo-code of the B-AS algorithm. The empirical variances ^ are computed according to Equation [S] 

Finally, we notice that, for Amin = 0, GAFS-MAX is more efficient than CH-AS. In fact, it over-pulls the 
arms with zero-variance only by and has a regret of order We will further study how the 

regret of CH-AS changes with n in Section 15.11 

As discussed in the previous remark, the reason for the poor performance in Lemma [I] for small Amin can 
be identified in the fact that Chernoff-Hoeffding’s inequality is not tight for small-variance random variables. 
In Section 31 we propose an algorithm based on a tighter inequality for small-variance random variables, 
and prove that this algorithm under-pulls all the arms by at most without a dependency on Amin 

(see Equations [TUI and [TT|) . 

4. Allocation Strategy Based on Bernstein UCB 

In this section, we present another UCB-like algorithm, called Bernstein Allocation Strategy (B-ASj^, 
based on a tighter variance confidence bound that enables us to improve the bound on \Tk^n — n\ by 
removing the inverse dependency on Amin (compare the bounds in Equations (TU] and [TT] to the one for CH- 
AS in Equation 0). However this result itself is not sufficient to derive a better regret bound than CH-AS. 
This finding is interesting since it shows that even an adaptive algorithm which implements a strategy close 
to the optimal allocation strategy may still incur a regret that poorly scales with the smallest proportion 
Amin- We further investigate this issue by showing that the way the bound on the number of pulls translates 
into a regret bound depends on the specific distributions of the arms. In fact, when the distributions 
of the arms are Gaussian, we can exploit the property that the empirical variance ^ is independent of 
the empirical mean and show that the regret of B-AS no longer depends on 1/Amin- The numerical 
simulations in Section 0 further illustrate how the full shape of the distributions (and not only their first 
two moments) plays an important role in the regret of adaptive allocation algorithms. 

4-1. The B-AS Algorithm 

The algorithm is based on the use of a high-probability bound, reported in [l^ (a similar bound can 
be found in i), on the variance of each arm. Like in the previous section, the arm sampling strategy is 
determined by those bounds. The B-AS algorithm, Ab, is described in Figured] It requires three parameters 
as input (see Remark 2 in Subsection 14.51 for a discussion on how to reduce the number of parameters from 
three to one) ci and C 2 , which are related to the shape of the distributions (see Assumption [T]), and S, which 
defines the confidence level of the bound. The amount of exploration of the algorithm can be adapted by 


®The original Bernstein inequality refines the ChernofT-HoefTding’s inequality by introducing the variance of the random 
variable in the confidence bound. This inequality has been later adapted to the case where the actual variance is unknown and it 
can be replaced by an empirical estimate of the variance (see i)- In [T^ a similar result is obtained for the variance, where the 
confidence bound displays a dependency on the empirical estimate of the variance, thus we refer to this algorithm as Bernstein 
Allocation Strategy. Furthermore, we notice that the inequality derived in does not follow from a trivial application of 
Chernoff-Hoeffding, since it provides a concentration inequality for the standard deviation which is not an average of i.i.d. 
random variables but the square root of an average of squared variables. 
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properly tuning these parameters. The algorithm is similar to CH-AS except that for each arm, the bound 
Bq t is computed as 


Bq^t — 


Tq.t- 


-0 


q^t—1 


+ 4a(Tq,t- 


/ log(2/^) 

Tq^t-l 


+ 4a' 


, log( 2 /J) \ 

Tq,t-1 I 


■where a = \/ 2 ci log(c 2 /( 5 ) + ^^ 1/2 ancd 

V i 2 / ; (1-5)72 log(2/5) ’ ^ 

Afc,i = 7^ ^ ^ (7^^^ = — - . (8) 

’ 2=1 ’ 2=1 

Note that actually flk,t^ ^k,t, B^^t, kt, and Tk^t depend on the arm index (except for kt), on the time step 
t < n, but also, either in a direct or in an indirect way (through the mechanism of the algorithm) on the 
budget n, on S which will be chosen as a function of the budget n, and also on ci and C 2 . However, since 
we consider most of the time a fixed budget n and thus a fixed S, and fixed ci, C 2 , we conserve this notation 
in order to have lighter notations. 


4-2. Regret Bound and Discussion 

The B-AS algorithm is designed to overcome the limitations of CH-AS, especially in the case of arms 
with different variances. Here we consider a more general assumption than in the previous section, namely 
that the distributions are sub-Gaussian. 


Assumption 1 (Sub-Gaussian distributions). There exist ci,C 2 >0 such that for all 1 < k < K and any 

e > 0, 

> e] < C 2 exp(-e^/ci) . (9) 

This assumption holds for the Gaussian distribution, and more generally for any distribution whose tail 
is lighter than Gaussian’s. It is thus held for bounded random variables. For example, if A G [0,1], then 
the assumption holds with e.g., ci = 1 and C 2 = e. 

We first state a bound in Lemma [2] on the difference between the number of pulls suggested by B-AS 
and the optimal allocation strategy. 

Lemma 2. Let Assumption]^ holds for Ci,C 2 > 1 and let 0 < S < 2/e. Define the event 


^K,n(^) — Pi 


l<k<K 

2<t<n 


\ 


t — 


t . t . 

Y X! “ Y ^ j) 




2 = 1 




< 2 a 


log(2/(5) 


1 ' 


where a = probability of higher than 1 — 2nKS. 

When we run the B-AS algorithm with parameters ci > 1, C 2 > 1, and 6, and budget n > 5K, on n(^) 
and for each arm 1 < k < K, we have 


Tku > - KXk 


16a^log{2/S) 2av^log(2/^)\ ^ ^ 1/4 




ciS) / 




and 


Tkn < + K 


l6a^\ogi2/d) ^ 2a7log(2/<5)\ ^ ^i /4 




ciS) j 

Proof. The proof is reported in [Appendix B .T and [Appendix B.2 




where c(5) = _ °\/3i°s(2/^) _ 

^ '' v^(VS-H3a7log(2/5)) ' 


( 10 ) 


( 11 ) 


□ 


^Unlike in Equation O here we use the unbiased estimator of variance. 
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Remark. Unlike the bounds for CH-AS in Lemma [H B-AS allocates the pulls on the arms so that, on the 
event bound on the difference between Tk^n and is now independent from Amin, while 

it preserves a y/n dependency on the budget. In practice, this difference may correspond to a significant 
improvement. In fact, for any finite budget n, if the arms are such that the term depending on Amin becomes 
the leading term in the bound in Lemma[Tl then we can expect B-AS to outperform CH-AS (see also Remark 
3 of Section 13?^ for further discussion of the performance of CH-AS for very small Amin)- Another interesting 
aspect of the previous lemma is that the lower bound in Equation [TU] can be written as CXk\/n (where C > 0 
does not depend on Afc). This implies that as allocation ratio Xk decreases (i.e., arm k should not be pulled 
much), the difference between and decreases as well. This is not the case in the upper bound, 
where the difference between Tk^n and does not have any linear dependency on Xk- This asymmetry 
between lower and upper bound is the main reason why the final regret bound of B-AS actually displays an 
inverse dependency on Amin as shown in Theorem [51 

Theorem 2. Assume that all the distributions {i'k}k=i sub-Gaussians with parameters Ci and C 2 . If 
the fixed (known in advance) budget is such that n > 5K, the regret of Ab, when it runs with parameters 
Cl > I, C 2 > 1, and S = is bounded as 


RniAs) < 


76400ci(c2 -|- l)Ar^(logn) 
A ' 


-h 


/(logn)^N 

V j 


Proof. The proof is reported in [Appendix B .3} 


□ 


Note again that this algorithm is efhcient for a fixed horizon n, although it might be possible to change 
it so that it is efficient for any horizon. 

Similar to Theorem [H the bound on the number of pulls translates into a regret bound through Equa¬ 
tion [AtJ] reported in [Appendix A.3| Note that in order to remove the dependency on Amin, a symmetric 
bound on \Tk^n — Tkn\ — ^kO{^/n) is needed. While the lower bound in Equation (TUI already decreases with 
Xk, the upper bound scales with 0{^/n). Whether there exists an algorithm with a tighter upper bound 
scaling with Xk is still an open question. Nonetheless, in the next section, we show that an improved bound 
on the loss can be achieved in the special case of Gaussian distributions, which leads to a regret bound 
without the dependency on Amin- 


4-3. Regret for Gaussian Distributions 

In the case of Gaussian distributions, the bound on the loss of Equation lA. 131 can be improved using the 
following lemma. 

Lemma 3. Let k < K. Assume that the distribution Vk is Gaussian (and independent of all other distribu¬ 
tions {vk')k'itk)- Then the loss for arm k of algorithms CH-AS or B-AS satisfies 


Lk,n — /4fc) ] — 

Proof. The proof is reported in [Appendix C[ 


IT, 


k^n 


( 12 ) 

□ 


Remark. Note that the loss in Equation 1121 does not require any upper bound on Tk^n- It is actually similar 
to the case of deterministic allocation. When Tk^n is the deterministic number of pulls, the corresponding 
loss resulting from pulling arm k, Tk^n times, is = (j'f./Tk^n- In general, when Tk^n is a random variable 
depending on the empirical variances (like in our adaptive algorithms CH-AS and B-AS), we have 


]E[(/ife,n — h‘kY\ = ^E[(/ife^ji — p,kY\Tk,n = t\P[Tk,n = t], 

t=l 


which might be different than cr^E 
1 , and E[^(/I;c p,k) |Tk:,n ■ 



- . In fact, the empirical average flk^n depends on Tk^n through 
might not be equal to a'^/t. However, Gaussian distributions have 




















the property that for any fixed-size sample, the empirical mean is independent from the empirical variance 
and this enables us to prove Lemma |31 which holds for both the CH-AS and the B-AS algorithm. 

We now report a regret bound in the case of the Gaussian distribution. Note that in this case Assump¬ 
tion [ 1 ] holds with Cl = 2 S and C 2 = 10 

Theorem 3. Assume that all the distributions {i^k}k=i dfe Gaussian and that an upper-bound S > 1/2 on 
S is known. If the budget is known on advance and such that n > SAT, the B-AS algorithm launched with 
parameters ci = 2S, C 2 = 1, and S = has the following regret bound 

RniAs) < ^ 3 !*^ ^ K'^ilognf . (13) 

Proof. The proof is reported in [Appendix C| □ 

Remark 1. In the case of Gaussian distributions, the regret bound for B-AS has the rate 0 {n~^A^ without 
dependency on Amim which represents a significant improvement over the regret bounds of the GH-AS and 
GAFS-MAX algorithms. 

Remark 2. In practice, there is no need to tune the three parameters ci, C 2 , and 6 separately. In fact, 
it is enough to tune the algorithm for a single parameter ay^log{2/6) (see Figure [5]). Using the proof of 
Theorem [2] and the optimized value of d, as well as the fact that for Gaussian distributions, ci < 2S, and 
C 2 < 1, it is possible to show that choosing a as in Theorem |3] means that a = 0((Elogn)^/^), where E 
is an upper bound on the value of E. This is a reasonable thing to do whenever a rough estimate of the 
magnitude of the variances is available. 

5. Experimental Results 

5.1. CH-AS, B-AS, and GAFS-MAX with Gaussian Arms 

In this section, we compare the performance of CH-AS, B-AS, and GAFS-MAX on a two-armed problem 
with Gaussian distributions vi = A/’(0, al = 4) and 1^2 = A/’(0, = 1) (note that Aniin=l/5). Figure [51-/Ze/t/ 

shows the rescaled regret, for the three algorithms averaged over 50, 000 runs. The results indicate 

that while the rescaled regret is almost constant with respect to n in B-AS and GAFS-MAX, it increases 
for small (relative to A“(j^) values of n in CH-AS. 

The robust behavior of B-AS when the distributions of the arms are Gaussian may be easily explained by 
the bound of Theorem 151 (Equation [T51) . Note though that this experiment seems to imply that there is no 
additional dependency in log(n): it could be just an artifact of the proof. The initial increase in the CH-AS 
curve is also consistent with the bound of Theorem[T](Equation[7l). As discussed in Remark 3 of Section 
we conjecture that the regret bound for CH-AS is of the form R„ < min {A“f,(^ 0 (n“®/^), 0 (n“‘^'^®)}, and thus, 
the algorithm’s regret is bounded as and for small and large (relative to 

values of n, respectively. It is important to note that the regret bound of CH-AS depends on the arms’ 
distributions only through the variances of the distributions, as shown in Theorem [TJ Finally, the curve for 
GAFS-MAX is very close to the curve for B-AS. For this reason, we believe that it could be possible to 
improve the GAFS-MAX analysis by using refined concentration inequalities for the standard deviation as 
done in B-AS. This might also remove the inverse dependency on Amin and provide a regret bound similar 
to B-AS in the case of Gaussian distributions. 


^Note that for a single Gaussian distribution ci = 2cr^, where tr^ is the variance of the distribution. Here we use ci = 2E 
in order for the assumption to be satisfied for all the K distributions simultaneously. 
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Figure 3: (left) The rescaled regret of CH-AS, B-AS, and GAFS-MAX algorithms on a two-armed problem, where the distri¬ 
butions of the arms are Gaussian, (right) The rescaled regret of B-AS for two bandit problems, one with two Gaussian arms 
and one with a Gaussian and a Rademacher arms. 


5.2. B-AS with Non-Gaussian Arms 

In Section 1331 we showed that when the arms have Gaussian distribution, the regret bound of the B-AS 
algorithm no longer depends on Amin- We also discussed why we conjecture that it is not possible to remove 
this dependency for general distributions unless a tighter upper bound on the number of pulls can be derived. 
Although we do not yet have a lower bound on the regret showing the dependency on Amin, i-e. that the regret 
might depend on the shape of the distribution, in this section we show that for Rademacher distributions, 
the regret of B-AS behaves in a different way than for Gaussian distributions with same variance. 

As discussed in Section 14.31 the property of the Gaussian distribution that allows us to remove the Amin 
dependency in the regret bound of B-AS is that for any sample of fixed size drawn i.i.d. from a Gaussian 
distribution, the corresponding empirical mean and the empirical variance are independent. The quantities 
{fik,n — and ak,n are however conditionally negatively correlated given Tk^n for e.g., the Rademacher 
distribution]! In the case of Rademacher distribution, the loss {pk,t — fJ-k)"^ is equal to ^ and we have 

( TSi (l - Am) , as a result, the larger ^ is, the smaller ^ is. We 

know that the allocation strategies in GH-AS, B-AS, and GAFS-MAX are based on the empirical variance 
which is used as a substitute for the true variance. As a result, the larger ^ is, the more often arm k is 
pulled. For the Rademacher distribution, this means that an arm is pulled more than its optimal allocation 
when its mean is accurately estimated (the loss is small). This may result in a poor estimation of the arm, 
and thus, negatively affect the regret of the algorithm. 

In the experiments of this section, we use B-AS in two different bandit problems: one with two Gaussian 
arms = J\f{0,ai) (with ai > 1) and m = A/'(0,1), and one with a Gaussian I'l = J\f{0,ai) (with ai > 1) 
and a Rademacher m arms. Note that in both cases Amin = A 2 = 1/(1 + <Ji). Figure^ (right) shows the 
rescaled regret (n^/^i?„) of the B-AS algorithm as a function of A“(jj for n = 1000. While the rescaled 
regret of B-AS is constant in the first problem, it increases with af in the second one. This leads us to 
the conclusion that the shape of the distributions of the arms has an impact on the regret of the algorithm 
B-AS. In fact, as explained above, this behavior might be due to the poor approximation of the Rademacher 
arm which is over-pulled exactly whenever its estimated mean is accurate. This result seems to illustrates 
the fact that in this active learning problem (where the goal is to estimate the mean values of the arms), the 
performance of the algorithms that rely on the empirical-variance (e.g., GH-AS, B-AS, and GAFS-MAX) 
depends on the shape of the distributions, and not only on their variances. This may be surprising since 
according to the central limit theorem the distribution of the empirical mean should tend to a Gaussian. 
However, it seems that what is important is not the distribution of the empirical mean or variance, but the 


is Rademacher ifXG{—1,1} and admits values —1 and 1 with equal probability. 
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correlation of these two quantities. This is why we believe that any algorithm that is based on empirical 
standard deviations might be subject to the same problem. However, at the moment no full satisfactory 
theoretical analysis is available on this point. 

6. Conclusions and Open Questions 

In this paper, we studied the problem of adaptive allocation for finding a uniformly good estimation 
of the mean values of K independent distributions. This problem was first studied by Antos et al. [![. 
Although the algorithm proposed in [l[ achieves a small regret of order it displays an inverse de¬ 

pendency on the smallest proportion Amin- In this paper, we first introduced a novel class of algorithms based 
on upper-confidence-bounds on the (unknown) variances of the arms, and analyzed two such algorithms: 
Chernoff-Hoeffding allocation strategy (CH-AS) and Bernstein allocation strategy (B-AS). For CH-AS we 
derived a regret similar to [i|, scaling as and with the dependence on Amin- Unlike in [i|, this 

result holds for any n > 5K and the constants in the bound are made explicit. We then introduced a more 
refined algorithm, B-AS, whose regret bound does not depend on Amin for Gaussian arms. Nonetheless, its 
general regret bound still depends on Amin- We show that this dependency may be related to the specific 
distributions of the arms and can be removed for the case of Gaussian distributions. Finally, we report 
numerical simulations supporting the idea that the shape of the distributions has an impact on the perfor¬ 
mance of the allocation strategies. 

This work opens a number of questions. 

• Distribution dependency. Another open question is to which extent the result of B-AS in the case of 
the Gaussian distribution can be extended to more general families of distributions. As illustrated 
in the case of Rademacher, the correlation between the empirical mean and variance may cause the 
algorithm to over-pull arms even when their estimation is accurate, thus incurring a large regret. On 
the other hand, if the distributions of the arms are Gaussian, their empirical mean and variance are 
uncorrelated and the allocation algorithms such as B-AS achieve a better regret. Further investigation 
is needed to identify whether this result can be extended to other distributions. 

• Lower bound. The results of Sections 14.31 and 15.21 suggest that the dependency on the distributions 
of the arms could be intrinsic to the allocation problem. If this is the case, it should be possible to 
derive a lower bound for this problem showing such dependency (a lower-bound with dependency on 
'''min)- matter of fact, no lower bounds are available for this problem and it would be interesting 
to provide some. 
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Appendix A. Regret Bound for the CH-AS Algorithm 

Let us consider n > 0 and 6 > 0 (that can be a function of n) fixed. We consider all the quantities 
considered in the definition of algorithm CH-AS defined with respect to these fixed n, 6, and use the 
abbreviated notations /xfc.t, o'k^t, Bk^t, kt, and Tk^t- 


Appendix A.l. Basic Tools 

Since the basic tools used in the proof of Theorem [T] are similar to those used in the work by Antos et al. 
[Ij, we begin this section by restating two results from that paper. Let ^ be the event 



l<t<n 


(A.l) 


Note that the first term in the absolute value in Equation (lA.ll) is the sample variance of arm k computed 
as in Equation ([S]) for t samples. It can be shown using Hoeffding’s inequality (see Hoeffding [1^) that 
Pr[^] > 1 — AnKS, and this is shown by directly reusing the elements of the proof of Lemma 2 in Antos et al. 
[l[ . The event ^ plays an important role in the proofs of this section and several statements will be proved 
on this event. We now report the following proposition which is analog to Lemma 2 in Antos et al. [l|. 

Proposition 1. For any fc = 1,.. ., A and t = 1,... ,n, let {Xk,i}i=i,...,Tk,t G {1; ■ • ■; 0 ki.d. random 

variables bounded in [0,1] from the distribution Vk with variance af., and af. ^ be the sample variance computed 
as in Equation ®. Then the following statement holds on the event 


\^k,t — CTfel < 3 a 


/ log(l/d) 

‘2-Tku 


(A.2) 


We also need to draw a connection between the allocation and stopping time problems. Thus, we report 
the following proposition which is Lemma 10 in Antos et al. [l|. 

Proposition 2. Let be a filtration and be an Tt adapted sequence of i.i.d. random 

variables with finite expectation p, and variance . Assume that Et and a{{Xs ' s > t-|-l}) are independent 
for any t < n, and let T(< n) be a stopping time with respect to Et- Then 


E 


T 


= E[T] 0-2. 


(A.3) 


Appendix A.2. Allocation Performance 

In this subsection, we first provide the proof of Lemma [T] and then use the result in the next subsection 
to prove Theorem [T] 


Proof of Lemma [H The proof consists of the following three main steps. We assume that ^ holds until the 
end of this proof. 

Step 1. Mechanism of the algorithm. Recall the definition of the upper bound used in Ach at a time 
t + l> 2K: 


B. 


q,t+l 




1 < g < A . 


From Proposition [U we obtain the following upper and lower bounds for Bq^t+i on the event 
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T, 


< B, 


q,t 




< 




q,t 


(Tq + 6 < 


l\og{l/S)\ 


2T, 


q,t 




(A.4) 


Note that as n > 4K, there is at least one arm k that is pulled after the initialization. Let fc be a given such 
arm and t + 1 > 2K be the time when it is pulled for the last time, i.e., Tk^t = Tk^n — 1 and Tk^t+i = Tk^n- 
Since Ach chooses to pull arm k at time f + 1, for any arm p, we have 


Bp,t+i < Bk^t+i ■ 

From Equation (IA.4I) and the fact that Tk^t = — 1, we obtain 


(A.5) 


Bk,t+i < 


T, 


k,t 


al + 6 < 


' log(l/<^) \ 1 

2T'fe_t j Tk^n — 1 


al + 6 < 


I log(l/<^) \ 

2 (rfe.n - 1 ) / ■ 


(A.6) 


Using the lower bound in Equation (IA.4I1 and the fact that Tp_t < Tp_„, we may lower bound Bp^t+i as 


Bp,t+i > 


Tr 


> 


p,t 


T, 


p,n 


Combining Equations IA.51 IA.61 and IA.71 we obtain 




< 


1 


Tk,n — 1 


tTj, + 04 


I log(l/<^) 

2{Tk,n - 1 ) 


(A.7) 


(A.8) 


Note that at this point there is no dependency on t, and thus, Equation (IA.8I) holds on the event ^ for any 
arm k that is pulled at least once after the initialization, and for any arm p. 


Step 2. Lower bound on Tp_„. If an arm q is under-pulled without taking into account the initialization 
phase, i.e., Tq^n — 2 < \q(n — 2K), then from the constraint J2ki"^k,n — 2) = n — 2K, we deduce that 
there must be at least one arm k that is over-pulled, i.e., Tk^n — 2 > \k{n — 2K). Note that for this arm, 
Tk^n — 2 > Afe(n — 2K) > 0, so we know that this specific arm is pulled at least once after the initialization 
phase and that it satisfies Equation (IA.8I1 . Using the definition of the optimal (up to rounding effects) 
allocation = nXk = na’^jYj and the fact that Tfc_„ > Xk{n — 2K) + 2, Equation (IA.8I1 may be written as 


< 


'^k,n 


^k + 6 ^ 


log(l/<5) 


2{Xk{n - 2K) + 2 - 1) 


< 


12 v^log(l/^) 
- 2K ' (A„n„n)3/2 


-f 


^ S 12v/log(l/<5) 4j^S 

“ n {Xnunn)^A n2 


(A.9) 


since Afc(n — 2K) -I- 1 > Xk{n/2 — 2K + 2K) -I- 1 > as n > bK (thus also 

if no arm is under-pulled after time 2K, then for each p, Tp^n > 2 -|- Xp{n — 2K) > Xp{n — 2K), i.e., 
cTplTp^n < <^p/{\{n — 2K)) = E/(n — 2K), i.e.. Equation (IA.9I) holds anyway (whether there are under¬ 
pulled arms or not). By reordering the terms in the previous equation, we obtain the lower bound 


T > 

p,n _ 


12y/log{l/S) 


4ifE 


> T* — \ 

— ^p,n 


12 


ea: 


3/2 


\/n log(l/(5) - 4ApA:, 


(A.IO) 
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where in the second inequality we used 1/(1+ x) > 1 — x (for x > —1). Note that the lower bound [5. 101 
holds on ^ for any arm p. 

Step 3. Upper bound on Tp^n- Using Equation (lA.lOll and the fact that J2k'^k,n = Y^k'^kn = 
obtain the upper bound 


Tv,u = n-J2Tk,n< v'^log(l/<5) + 4K . (A.ll) 

k^p ^"^inin 

The claim follows by combining the lower and upper bounds in Equations lA.lOl and lA.lll □ 

Appendix A.3. Regret Bound 

We now show how the bound on the allocation over arms translates into a bound on the regret of the 
algorithm as stated in Theorem [1] 


Proof of Theorem[^ The proof consists of the following two main steps. 

Step 1. For each 1 < n' < n, Tk^n' is a stopping time. For a given k, let be the filtration 

associated to the process {Xku}t<m and £-k = £-k,n be the cr-algebra generated by {Xk'^t'}t'<n,k'^k 
(“environment”). Let = a{p[’^\£-k)- 

We prove for fixed budget n by induction for n' = 1,..., n that each is a stopping time with respect 
to the filtration {Qf"^)t<n- 

For n' < 2K (initialization), Tk^n' is deterministic, so for any t, {Tk^n' < 0 i® either the empty set or the 
whole probability space (and is thus measurable according to 

Let us now assume that for a given time step 2K < n' < n, and for any t, {Tk^n' 0 i® -measurable. 
We consider now time step n' + 1. Note first that for t = 0, {Tk^n'+i < t} = {Tk^n'+i < 0} is the empty set 

( k) 

and is thus Gt -measurable. If t > 0, then 


{Tk,n' + 1 < i} = {{Tk,n' = t} n {kn' + l k}) U {Tk^n' < ^ — !}■ 


(A.12) 


By induction assumption, {Tk^n' = t} and {Tk^n' < i — 1} are -measurable (since for any t', {Tk^n' < t'} 
is -measurable). On {Tk^n' = t}, ^n'+i is also -measurable since it is determined only by the values 
of the upper-bounds {Bq^n’+i}i<q<K (which depend only on {Xk’^t’}t'<n,k’^k and on (Afc^i,..., X^,*)). 
Hence, {Tk,n' = t} O {kn'+i k} is -measurable, and thus using (IA.12I1 . we have that {Tk^n'+i < 0 is 
-measurable, as well. 

We have thus proved by induction that Tk^n' is a stopping time with respect to the filtration {Gt^'^)t<n- 
Step 2. Regret bound. Using its definition, we may write Lk,n as follow: 


Lk_n = IE 


{idk,n k'k) 


= E 


(Afc,™ f^k) 


-E 


(Afe.n - 


Using the definition of fik,n and Proposition [2] for filtration {Gt^^}t<m {Xk,t}t<m and Tfc^„ (and that = 
a{{Xku' '■ t' < t}U {Xk'u' '■ t' < n,k' k}) and a{{Xku' '■ t' > t + 1}) are independent for any t < n) we 
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bound the first term as 


E 




<sup( /'t )e 
1 


[(St=’l ^k,t Jfc.nMfc)^ 


■i{a 


n.r, 


< sup —C^ Xk,t - Tk,n^J'kf 

C fc,n^ ^—i 

= sup(^)^cr^E[rfc,„] 

C ^ k,n ^ ^ k 
2 

= sup(^)E[rfe,„], 


(A.13) 


Since the upper-bound in Lemma [T] is obtained on the event ^ (and thus with high probability), and as 
Tk,n < n, we may easily convert it to a bound in expectation as follows: 


E[Tk^ri] < (Tfc.n H-^^-v/nlog(l/(5) 4- +nx AnK6. 


yi \ 3/2 


(A.14) 


Combining Equation (IA.13|) and IA.141 and using Equation (IA.9I1 for sup^ obtain 

ifi'k,n /^fc) I{C} 


E 


^ fS ^ 12v^log(l/,5) ^ 4KE 


('^k.n + log(l/<5) + 4A: -h n X 4nKS) 


^ n (Aminn)3/2 

By setting A = to simplify the notation, Equation (IA.15|) may be simplified 

-^mii 

E 


(A.15) 


as 


< 


(Afc.n Mfe) 


n n 


A , 4Ke\ fn 


3/2 


S Sal 


A ^ 4K + 4n^KS 
Vn -I - 2 - 


'e 2 A2 IbAT^E^ 2 AE 8KS^ 8AKs\ 


— ~ 4-j + 




2AS 1 






A^ 


7,5/2 


16K^S^ 


n 


7,7/2 


-h 8KS^ + 




](■) 
)(■■). 


8AKS\ 
771/2 ) 


where in the last passage we used n > 5K. Let B = A^ + 12KS^ + 4A'/KS. We further simplify the 
previous expression as 


E 


{fJ‘k,n fJ-k) I{C} 


E 

< — 


1 /SA 


n 775/2 V cr| 


„ ,, 1 /4a:e2 


2A^ 




1 /SSAi^ , AB\ AKB 




+ 


2r,3 


atn 


MKS^ 8S AK 4KB\ 


I 


-f 


aln^l"^ 


aln ) 


S. 
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We now choose S = n and by using n > 5K and Amin < ^/K we obtain 


n 


< 


n 

_ E 
n 


1 / YjjA. n a\ ^ /a r^v-^ B , 


<5+ 

n 


1 1 

^EA 

n3/2 \ 


1 , 

^EA 

n3/2 \ 


1 , 

^EA 

n3/2 \ 


1 , 

^EA 

n3/2 ' 



„A 1 2A^ B 4T,Ay/K 

+ 2A) +— -2- + — + ^ + 

J rfi \ at a^ S 


+ 


AB 


B 2E‘^^/K 2T,A , B 


2^/KalY. 
AB 


H- o + 


B 


21:A B \ 

2./Kat) 


2-/Kat 
B \ 


.— + — + 2A2\fl^ + 2A + — 

2 S 2 /^ S 2VKE 


) 1 / 9/4^RRR 

+ T- -(4KE + 2E^/K + 4A^/K + 2A+ — + ^ _ + — 

AminU^ V S S 2 ^/^E 


AB \ 


1 / YjjA a\ 1 / t A r^2 A / A /T7 r\\ 2^ B B B 

^ ^ Ami„n2 i^+ ) + ~ + s 4E3/2 ^ /re 


2 /^E A'S 2T?^/K^ 


b^ab. 


4E5/2 7’ 


where the last passage follows from E < K/4. 

Before proceeding further we notice that Amin < l/-?f and thus 


;^3/2 ^ 


1 


A 


< 


A 


A 
< —. 


A^/f„ 12ytoi(T7^ 12v/(5/2)logn 27’ 


where the first passage follows from the definition of A and the second from S = n and n > 5K > 10. 
This implies by definition of B 

B = + 12KT? + 4A^/KT. < + 3^2/27^4 + A^ 121 = 1009^2/972 < 27^2/26 < 1.05212, 

where we use E < K/4. By using the previous bound, we finally obtain since 1 . 47 f 2 < < 0.7212/272 < 

212/1041 


E 


(/ifc,n /2fc) I{C} 


E 1 2 E 2 I 
~ n A A V fT 2 
E 1 /E^ 
~ n~^ V fj? 


E 1 221 1 

<- 1 -57^7":-f 


221 

2A 


„A 1 /, 277 22 I 2 1.05212 1.05212 1.05212 1.052l3\ 

+ A + + ■■‘(4^+2) + — + — + ISTF + -ifir + -stf) 


1 


4E3/2 

1.05y43\ 


Ami 


■min 


n‘ 


{A^mi + 0.9#/» + 3.6(1 + i_).4> + 1|E) 


.(0.m 3/^ +3.7(1+ E)3l» + t|lf). 

Since |/ifc,n — ^fc| is always smaller than 1, we have E[{jlk,n — < 4nKS = 4Kn~^^^. We also know 

that A < i£V^°p^ Thus the expected loss of arm k is bounded by 


r / S ^ 38x/log(n) 

- n ^ n3/2A'/" ' A: • -2 




< 


E 39A/log(n) 2.9 X 103 (logn)3/2 / 1 \ 

n”^ „3/2A^/2 A >11/2 

5SV2 • 


since 


1 < 1 4 - 4 

^ 5 + 


Using the definition of regret = max^ Li-.n ~ ^ve obtain 

39A/log(n) 2.9 X 103 (logn)3/2 / 


RniAcn) < 


n3/2Ai/m 


A 11/2 

^min 


11 1 

E ^ E2 E5/2 


)■ 


(A.16) 

□ 
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Appendix B. Regret Bound for the Bernstein Algorithm 

Let us consider n>0, 0<(5<1 (that can be a function of n), ci > 0 and C 2 > 0 fixed. We consider all 
the quantities considered in the definition of algorithm B-AS defined with respect to these fixed n, ci,C 2 , 
and use the abbreviated notations ^k,t, kt, and Tk^t- 

Appendix B.l. Basic Tools 

Before proving the bound in Theorems [5] and [3] we need a number of technical tools, in particular for 
sub-Gaussian random variables. 

The upper confidence bounds Bk,t used in the B-AS algorithm is motivated by Theorem 10 in [l^. We 
extend this result to sub-Gaussian random variables. We first restate Theorem 10 of 0: 

Theorem 4 Let Xi,.. .,Xt be t > 2 i.i.d. random variables with variance and mean p and such 

that {Xi}l^^ G [0,&]. Then with probability at least 1 — <5, we have 


\ 


t — 


t 1 ^ 


xp -a 


i=i 


< b 


21 og(2/J) 

t-1 


We now state and prove the following lemma (first statement of Lemma [2]). 

Lemma 4. Let Assumption]^ holds, and n > 2, ci > 0, C2 > 0, and 0 < S < min(l,C2). For the event 


C = ^lniS)= n 


l<k<K 

2<t<r. 


\ 


t- 1 


l ^ t r 


— CTfe 


i=l 


1=1 


< 2a 


log(2/^) 


r 


(B.l) 


where a = 2y^ci log(c2/(5) -F > 1 - 2nK6. 


Note that the first term in the absolute value in Eg uation lB.il is the empirical standard deviation of arm 
k computed as in Equation [5] for t samples. The event ^ plays an important role in the proofs of this section 
and a number of statements will be proved on this event. 

Proof. Step 1. Truncating sub-Gaussian variables. We want to characterize the conditional mean and 
variance of the variables Xk^t given that \Xk^t ~ Mfel ^ \/^i log(c2/<5). For any non-negative random variable 
Y and any b > 0, E[yi{y > &}] = F[Y > e]de + bF\Y > 6]o In order to simplify the notation we 
introduce the deviation random variable Sk,t = Xk^t — h-k- If we take b = ci log(c2/(5) and use Assumption 
[U we obtain P[S'^ ^ > b] < 6 and 


E Si,1 {Sit >b}= F[Sl, >e]de + bF[Sl, >b]< C 2 exp(—e/ci)de + bc 2 exp(—6/ci) 

^ Jb ’ ' Jb 

= ci5 Y ci5\og{c2/5) = ci(5(l -I- log(c2/5)). 

By definition of Sk,t, we have E[5'^ ^ > &}] -|- E[S'| ^ < 6}] = cr|, which can be written as 


E[Sl,I{Sl,>b}]-alF[Sl,>b] ^ 


< b] 


(B.2) 


'iLet Y = YI{Y > b} + bI{Y < b}, then E[y] = Jg P[y > e]d£ + P[y > e]d£ = b + P[y > e]de. Thus we can write 
E[yn {y > b}] = E[y] - bp[y < 6] = p[y > e]de + bP[y > 6]. 
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that combined with the previous equation, implies that 


E 






E 


{Sl,-al)l{sl,>b} 


< 


nSh < b] 

Cl (5(1 + log(c2/(5)) + 6al 


l-(5 


(B.3) 


where we use l + log(c 2 /( 5 ) > 0, that follows from 5 < 02 - Note also that Cauchy-Schwartz inequality implies 

E[5,,,I > 6 } ] I < ^Je[SI,I{SI, > b}] 

< Vci(5(l + log(c 2 / 5 )). 

We now introduce the conditional mean of Xk^t conditioned on small deviations, that is jlk = ^ < 

b] = ■ Thus we can combine E[Xfe_tI{5'^ ^ > &}] + E[Xfe_tI{S'^ ^ < 5}] = with the previous 

result and obtain 

E[5mI{ 52 t > 6}] 


l/lfe - Mfel = 


< 


V^ci(5(l +log(c2/(5)) 


l-(5 


(B.4) 


nSh < b] 

We also define the variance of the conditional random variable = Y[Xk,t\Sl ^ < &] = E[5'^ < 

b] — {fJ-k — fikY- From Equations IB. 31 and IB. 41 we derive 

\al -al\< E[Sl,\Sl, < b] - al + {fik - ^^kf 


< 


< 


ci(5(l + log(c2/(5)) + (5cr^ ci(5(l + log(c2/(5)) 


1-5 

2ci5(l + log(c2/5)) + Sal 

■ 


(1-5)2 


In order to get the final result, we first bound the variance cr^ as a function of the constants ci and C 2 using 
the sub-Gaussian assumption as 


poo poo 

al = E[(Xfe,t - = / P[Xk,t - fJ^k)"^ > £]de < C 2 exp(-£/ci)cie = C 1 C 2 . 

Jo Jo 


Finally, using > \x — y\ for x,y > 0, we obtain 


Idfc ak\ 


y^2ci5(l + C 2 + log(c2/5)) 


1-5 


(B.5) 


(B. 6 ) 


Step 2. Application of large deviation inequalities. 

Let = ^i,K,n{S) be the event: 


Cl = 


Pi - Mfcl < \/ci log(c 2 /( 5 )| . 


l<k<K, l<t<r. 


Under Assumption [U using a union bound, we have that the probability of this event is at least 1 — nKS. 
On ^ 1 , the {Xk^i}i, 1 "£ k < K, 1 < i < t are t i.i.d. bounded random variables with standard deviation ak- 
Let ^2 = ^ 2 ,K,n{S) be the event: 


6 = 


n 


l<k<K. 2<t<n 


\ 


ih if -1 if ^ ■ 

i=l j=l I 
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Using Theorem m and a union bound, we deduce that Pr[^i (" 1 ^ 2 ] > 1 — 2nK6. Now, from Equation (jB. 6 |) . 
we have on fi ^ 2 , for all 1 < fc < iiT, 2 < t < n: 




1 

t- 1 


E 


1 

1 






< dv'ci \og{c2l5)\j^^^^^i^ ^ 


^/2ci6{l + C 2 + log(C 2 /^)) 
^/2ciS{l +C 2 + log(c2/(5)) 

l-S 


from which we deduce Lemma 0] (since ^1 H ^2 C ^ and 2 < t < n). □ 

We transcribe the definition (1131 of ^ in the last lemma into the following lemma when the number of 
samples Tk^t are random. 

Lemma 5. For t = 2K,... ,n, let bs any random variable taking values in {2,...,n}. Let tr^ ^ be the 
empirical variance computed from Equation Then, on the event f, we have: 


\crk,t - CTfcl < 2 a 


fog( 2 /^) 

Tk,t 


(B.7) 


where a = 2 y^ci log(c 2 /( 5 ) + 


-^Tfc,tCi(5(l+e2+log(c2/<?)) 

(l-5)^21og(2/5) 


Appendix B.2. Allocation Performance 

In this section, we first provide the proof of Lemma [21 we then derive the regret bound of Theorem |2] in 
the general case, and we prove Theorem [3] for Gaussians. 


Recall that n > 5K. This will be useful in the following. 


Proof of Lemma |3 Note first that the first part of the claim of the lemma is exactly Lemma H The rest of 
the proof consists of the following five main steps. Until the end of the proof, we assume that f holds. 


Step 1. Lower bound of order Lt{^/ri). We first recall for any arm q the definition of i?g,i+i used in the 
B-AS algorithm 


B, 


q,t+l — 


Tc 


q,t 


rq,t 


+ 2a 


log(2/5) \ 




q,t 




Using Lemma [5] it follows that on for any q such that Tq^t > 2 

1 


< Bq,t+, < — 

J-q,t J-q,t 


Oq + 4a 


log(2/^) j 
Tq,t J 


(B.8) 


Let q be the index of an arm such that and t + 1 < n be the last time that it was pulled, 

i.e., Tq^t = — 1 and Tq^t+i = From Equation (IB. 81) and the fact that Tq^n > > 5 (see condition 

on c{6), and also the beginning of this section) and Tq^t > 3, we obtain on f 


^*^Note that such an arm always exists for any possible allocation strategy given the constraint n = Bp^n- 
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Bq,t+1 



+ 4a 


log(2/<?) \ 
Tq,t J 


2 



(B.9) 


where we also used > 4 to bound Tq^t in the parenthesis and the fact that Oq < a/S- Since at time t + 1 
we assumed that arm q has been chosen then for any other arm p, we have 


Bp,t+i < Bq^t+i- (B.IO) 

From the definition of removing all the terms but the last and using the fact that Tp_t < Tp^n, we 

obtain the lower bound 


Combining Equations Ill we obtain 

2log(2/(i) 4:K(^VTj + 3a^log(2/(5)j 


(B.ll) 


Finally, this implies that for any p 


2ayiog(2/5) / 3n 


“ V^ + 3a^\og{2/5) V 4iC 

In order to simplify the notation, in the following we use 


,{S) = 


a\/31og(2/^) 


K{V^ + 3a^log{2/6 


thus obtaining Tp^n > c{6)y/n on the event ^ for any p. 


(B.12) 


Step 2. Mechanism of the algorithm. Note that as n > 5K, there is at least an arm q that is pulled 
after initialization. Let, for such an arm g, t + 1 > 2K be the time when arm q is pulled for the last time, 
that is Tq^t = Tq^n ~ 1 > 2. Since at time t + 1 this arm q is chosen, then for any other arm p, we have 

Bp^t+i < Bq^t+i ■ (B.13) 


From Equation (ESI) and Tq^t = Tg,™ — !> we obtain 


B, 


94+1 < 


T, 


q,t 


Oq + 4a 


log(2/5) \ 


To 


4 / 


(T„ + 4a 


Furthermore, since Tp^t < Tp^n and Tp^t > 2 (as t > 2K), then 

Bp^t+i 

Combining Equations IB. 131TB. 151 we obtain 


> ^ > _zL. 

2p4 Tp^n 


n. 


-{Tq,n — 1) < CTq + 4a 


log( 2 /<^) ] 

Tq,n - 1 J 


log(2/<^) \ 

Tq,u - 1 J 


(B.14) 


(B.15) 


Summing over all q that are pulled after initialization on both sides, we obtain on ^ for any arm p 
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-2K)< Y. + 


(B.16) 


because the arms that are not pulled after the initialization are only pulled twice (so ~ 1) ^ 

n-2K). 

Step 3. Intermediate lower bound. It is possible to rewrite Equation (IB.16I1 . using the fact that ^ > 2, 
as 




-(n — 2K) < ^2 ( ^9 + 4a< 


/log(2^ j ^ ^ 

Tq,n - 1 y 


CTq + da* 


!log(2/^) \ 


T„ 


Plugging Equation (IB.12I1 in Equation (IB.171) . we have on ^ for any arm p 


2A-, < E (». + -y/M)’ < 


(B.17) 


(B.18) 


because for any sequence {ak)i=i,...,K > 0, and any 5 > 0, '^^.{ak + b)'^ < (\/Sfc + '/KbY by Cauchy- 
Schwartz. 

Building on this bound we shall recover the desired bound. 

Step 4. Final lower bound. We first expand the square in Equation (IB.171) using ^ > 2 as 


7p^{n-2K) < + 8aV21og(2/^)^ ^ 


32fl^ log(2/(i) 
To.n 


We now use the bound in Equation (IB.18|) in the second term of the RHS and the bound in Equation (IB.121) 
to bound Tk^n in the last term, thus obtaining 


K ( 


^{n-2K) < y+ Sa^/21og(2M) —-If- V^ + 4VKaj2 

Tp.n Vn — 2K \ '' 

By using again n > SitT and some algebra, we get 


, log(2/<i) \ 32j^a^log(2/^) 

c{S)^/n j c(S)y/n 


- 2K) < E + A (yf + + 32A-aU,e(2/i) 

Tp^n \/n y c{d)y/n J c{^)^/n 

+ 32Ka^log^/6) 

V n c(S)^ 


< E + : 


= E + 


161t'a\/log(2/^) / I 2a-yiog(2/J) \ 3/2^2 log(2/^^_3/4 


\/n 


c(3) J 


VW) 


(B.19) 


We now invert the bound and obtain the final lower bound on Tp_„ as follows: 


ap{n — 3K) 


16j^aVlog(_2^/^^ 2aVlog(2/J)\ ^ g4y2J^3/2^2 log(2M) ^- 3/4 


al{n-2K) 


^ Tp^n KXp 


Syn ^ 

4 _ 16j^a-yiog(2/(5) /^ 2 a-yiog( 2 / 6) \ ^ iog( 2 /^) j^- 3/4 

E^n c{S) J - ' 

16aVlog(2/ I) / ^ 2 ayiog( 2 /(i) ^ ^ 1/2 ^ ^ 1/4 


c{5) J 




<S) J 




n^^^+2 
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Note that the above lower bound holds on ^ for any arm p. 

Step 5. Upper bound. The upper bound on Tp^„ follows by using = n — "^q,n and the previous 

lower bound, that is 


Tp^n < n 'y ^ '^q,r 




<T + K 

— -^p,n ' 


16a^log(2/(5) 


Vs + + 64v^a^+ 2 


c(5) ) 

<S) ) 


16aVlog(2/3) 2ayiog(2/(i) \ ^ log(2/^ ^ 1/4 


a- n'" +2 

SEW 


□ 


Appendix B.3. Regret Bounds 

With the allocation performance, we now move to the regret bound showing how the number of pulls 
translates into the losses Lfc„ and the global regret as stated in Theorem [2J 
We first state some technical results. 

Appendix B.3.1. Bound on the Regret Outside ^ 

The next lemma provides a bound for the loss whenever the event ^ does not hold. 

Lemma 6. Let Assumption]^ holds. If2nK6 < ci, then for every arm k, we hav¥^ 

]E[(/Xfc,„ - < 2cin^K5{l + \og{c 2 /2nK6)). 

Proof. Since the arms have sub-Gaussian distribution, for any 1 < k < K and 1 < t < n, we have 

^[iXk,t - tJ-kf > e] < C 2 exp(-e/ci) , 

and thus since C 2 > 2nK6, we obtain 

P[{Xk,t - > Cl log(c2/2ni4:(5)] < 2nKd . 

Since P[C‘"] < 2nKS, the previous equation implies, using C2/{2nKS) > 1 

poo 

= / ^[{Xk,t-tkkfl{f^}>e]de 
Jo 

pOO 

< / C 2 exp(-e/ci)de + Cl log(c2/2n/f<5)P[^‘^] 

J Cl \og,{c2/2nK5) 

< 2cinK6{l + log{c2/2nKS)) . 

The claim follows from the fact that E[{fik,n — < StLi ~ ^ 2cin‘^KS{l + 

\og[c2/2nK5)). □ 


^^Note that for 5 = n n > 5K, and C 2 > 1, we have 2nKS = 2Kn < C 2 . 


23 

















Appendix B.3.2. Other Technical Inequalities 
At first let us write, for the sake of convenience, 


B ^ lQKa^/\og{2/5){^+ 2a7log(2 ^\ C = 64\/2A^/^a^ log(^) 


Upper and lower bound on a. 6 = n with n > 5K > 10 and C 2 > 1 


a = 


2 v' .=.logteA) + 

(l-^)v'21og(2/5) 


< V14ci(c 2 + l)log(n) + + C 2 ) < \/15ci(c2 + 1 ) log(n) 

< 4 v^ci(c2 + 1) log(n). 

We also have by just keeping the first term, since C 2 > 1 


a = 


o rA —7T77Y , \/ci(5(l + C 2 + log(c2/5)) r- 

2v'cilog(«/^)+ >2v/5>v/5. 


Lower bound on c{S) when 6 = n See Lemma [2] for the definition of c{6). Using the fact that the arms 
have sub-Gaussian distribution we showed in Equation (IBAl) that (T^ < C 1 C 2 , then we also have E < KciC 2 . 
If d = , we obtain by using the previous lower bound on a that 


c(S = „-T/ 2 ^ ^ aV31og(2/3) 

+ aV31og(2/(i)) 


VM\ yE73 + aVlog2/5 J 


>^fl _V^IZ!_j 

~ 7573 +Vci log 2/5 y 


> 


1 A yw3 \ 

v/3A\ + 


> 


VV I y/Kc2 + %/3 


by using E < Arc 2 Ci for the last step. 

Upper bound on the loss outside ^ when 5 = n~'^ 1"^. We get from Lemma [5] when d = when C 2 > 1 

and when n > 5K that 


E[(Afc.n - ] < 2cin"iL<5(l + log {^^)) < 2c,Kn-^/^(l + (c 2 + 1) log (^)) 

< 2ciKn~^^‘^[l + ^(c 2 + l)log(n)) < 7ciK{c2 + 1) log(n)n“^/^. 

Upper bound on B for S = See the proof of Theorem [2] for the definition of B (the notation B we 

use in this section is for technical purposes and has nothing to do with the B introduced in the proofs for 
algorithm CH-AS). When 6 = when C 2 > 1 and when n > 5K > 10, 
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B = l&Ka^J\og{2|5) 

< l%Ka^/l/2\og{2n)[y^ + 2^{V^ + 30^7/2log(2n))) 

< 167^0^/7/2 log(2n)(yE + 2^^ + l 2 ^/Ky'cl{c 2 + l)71og(n) log(2n)) 

< 167^0^/7/2 log(2n) ^377-^/0102 + 45\/^\/c/(c^"+T) log(n)^ 

< 327ir\/l4ci (c 2 + 1) lognlog(2n)^487^-\/ci(c2 + 1) log(n)^ 

< 8 X 10^7i:2ci(c2 + 1) log^(n). 

Upper bound on C for 6 = See the proof of Theorem [5] for the definition of C. When 6 = 

when C 2 > 1 and when n > 5K > 10, 




< 64^7f3/2^3/2(iQg(2/5))3/4_ifi/4(y]^ + 12v'ci(c2 + l)lognv'71ogn)^/2 

< 128^2^^7C'^/'‘(2\/2ci(c 2 + 1) logn)^/^(71ogn)^/'^\/^74:^/‘^(ci(c2 + l)Y/'^^J\ogn 

< 14 X 10 ^ 7 C^ci(c 2 + 1) log^(n). 

We are now ready to prove Theorem [2] 

Proof of Theoreml^ Equation (IB.19I1 becomes using the constants B, C that we introduced 

We also have the upper bound in Lemma [5] which can be rewritten: 

Tp,™ < r;,„ + f + f + 27f. 


(B.20) 


Note that because this upper bound holds on an event of probability bigger than 1 — AnK6 and also because 
Tp^n is bounded by n anyways, we can convert the former upper bound in a bound in expectation: 


E[Tp.„] < T;,„ + f + 21^ + n X ^nK&. 


(B.21) 


We recall that the loss of any arm k is decomposed in two parts as follows: 

Lk,n = E[(/ifc,„ — {^}] + 

By combining the fact that is again a stopping time with Equations IB.201 IB.211 and IA.3I (as done in 
Equation (lA. 131) 1. and since n — 2K > 0, we obtain for the first part of the loss: 
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E[(Afc,n - Pfc) I{C}1 

1 


-ol[n-2KY 




(Tfc,„ + f + 2^ + ^nKS) 


C 


<- 


■{n-2KY 


B 


C 




{B + Cf\(n 


B 

S E^Afc 


^ C 1/4 27f 47127^(5 N 

S2Afc"' SAfc / 


<- 


1 / B ^ C -\-2KYi 1/4 4n KTjS „n /— 1/4 

„ 7 -^x 9 I ^E] + ——+--- n +---h 2B^yTL + 2Ctl 

2K)-‘ Afc Afc Afc 

2/1 ^ B + C ^ 2K^ 


'-{n- 

2{B + C){§ + % + 2K) 8(B + C)n^/^KS 


Afc 


B 


nYj + (--h 2B)^/n + ( 

Afc 


{n-2KY 

2{B + C){^+2K) 


Afc 

c + 27t:s 

Afc 


+ {B + CY 


S^Afc EAfc/ 


+ 4n7s:5 


(-B + Cf 

EAfc 


+ 2C')n^/^ 


I (p I ^ I ^ ^ I 2-ft' \ 

+ (B + C) (- + ^^ + _j 


4n2 7^E5 8(B + JB + C)^ 

H-;-h - r- -^ 4n7f(5 ^ ' 


Afc 


Afc 


EAfc 


< 


{n-2Ky 

K(B + Cf ( 
Afc 

Afc 


3-B ^ 3C + 27i'E 1/4 

nE + -—H-T- "n 

Afc Afc 


/ 2 4 Afc 1 2 \ 

V7i'E(B + C') (B + C)2 A'E(B + C) E^Tf E(B + C')/ 

(e + 2(B + C)+ + 


and since B + C > 2 for S = n n> 16K/3 > 8, it implies 

E[{p.k,n - Pfc)^lI{C}l 

1 


< 


(n - 2KY 
45n'^K 


V. ^ 

72E 4 —-—^/n + 
Afc 


Afc 


(e + 2(B + C)+ + 


3(7 + 2 B'E 1/4 BIB + (7)V 1 

--- n' 4-' — 

Afc 

{B + Cf 


Afc 


V2E ^ 8E 2E2 E 


<- 


■{n- 
ASrYK 


1 f ^ 3B ^ 3C + 2KE 1/4 B'(B + (7)V 1 13,. 

" +^3r^fe + T^+‘) 

(E + 2(fl + C) + 15±2)i)j 


<- 


Afc 


1 


^ (n - 277)2 
4Sn^K 


Afc 


3B ^ 3(7 4- 277E 1/4 

nE 4- -T—Vn 4-r- n ' 

\ Afc Afc 

(e + 2(B + C)+ + 
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Now note that, as 6 = n and n > 4K 

1 


E[(/ife,„ - Atfc) I {5}] < 


(n - 27^)2 


^ 3B ^ 3C + 2K-B 1/4 + 1 /, 5 + ^x2^ 

“^+1^^+—+'"Sr-^%+« + rf7^(‘ + —) J 


< ~ + 3 

\ r>^ r)'5 


1 , SK \ I ^ , 3B ^ ^ 3C + 2KE 1/4 , ,^(^ + C’)% 1 


nYj + ——^/n + 
Afc 


Afc 


+ Tf- 


E 8ifE 3 /3B ^ 3C + 2A'E 


^ — + -^Vn + 


Afc 


A, + + + (1+^) 


^ f + Jl: + ^ ('-'<«+^>’'‘++ '*«) 

E QS Q 1 1 

< - + -+ SK{B + (7)^(1 + E)(— + 21)- 


n n^/^Xn 


'E2 'nV-^Xn 


again since B + C > 1. 

Finally, combining that with Lemma [S] gives us for the regret: 

RniAs) < ^ 3 / 3 ^ + + 21)(1 + S) + 2cin"A'5(l + log(c2/2nA'5)). 

By taking (5 = and recalling the bounds on B and C in [Appendix B-lO] we obtain: 


R^{Ab) < — + l)7f log(n)n-"/^ 

76400 ci(c 2 + l)A^log(n)^ /log(ri)®7^' 


< 


Aminn3/2 


/ log(n)“K' N 
+ nV4A^i„ )■ 


□ 


Appendix C. Regret Bound for Gaussian Distributions 

Here we report the proof of Lemma[3]which implies that when the distributions of the arms are Gaussian, 
bounding the regret of the B-AS algorithm does not require upper-bounding the number of pulls „ (it 
can be bounded only by using a lower bound on the number of pulls). 

Let {At}t>i be a sequence of i.i.d. random variables drawn from a Gaussian distribution A/'(/i, cr^). Write 
TO* = j X]i=i ~ for the empirical mean and variance of the first t samples. 

Before proving Lemma we recall a property of the normal distribution (see e.g., i)- 

Proposition 3. Let Xi,..., Xt be t i.i.d. Gaussian random variables. Then their empirical mean rht = 
j Y^i=i empirical variance sf = ~ independent of each other. 

Based only on the well-known t = 2 case (i.e., that Ai +X 2 and |Ai — A 2 1 are independent), we can derive 
a somewhat stronger result that is used in the proof of Lemma |31 showing that for Gaussian distributions, 
the empirical mean mt built on t i.i.d. samples is independent from the sequence of standard deviations 
(s 2 , St) (not only from S/). 

We first derive a general result showing that for Gaussian distributions, the empirical mean rht built on 
t i.i.d. samples is independent from the sequence of standard deviations § 2 , ■ ■ ■ ,st. 
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Lemma 7. Let Tt be the a-algebra generated by the seguence of random variables S 2 , ■ ■ ■ ,st. Then for all 
t > 2, 

^ — j- 

To prove Lemma [3 we need the following technical lemma: 

Lemma 8. We have 


’t+i 


+ , , T i.^t+1 — mtY . 


t 


t + 1' 


Note that this statement is deterministic, it holds for any process or seguence. 
Proof. We have for t> 2 

t+i 


=’i+l 


1 




t 


J - Wt+1 +mt- rhtf' + \{Xt+i - rht+if 


t 


i - rhtf + ^{Xt+i - mt+if + {rht - mt+if 


t 


= -'<^tf 


t 


(t +1) 


{^t+i — hitf + , {Xt+i — rhtf 


it + f^ 


= - mtf + j^{Xt+i - rhtf, 


which finishes the proof. 


□ 


From Lemma [5] we deduce by induction that for any t > 2 there exists a sequence of non-negative real 
numbers {ai,t,a 2 ,t, ■ ■ ■ Wt.t} such that 


t-i 

f = Qi.t’sj + ~ hiif. 

i=2 


Proof. We prove the statement by induction. 

The base of the induction {t = 2) is directly implied by the specific properties of Gaussian distributions 
(Proposition [3]). In fact, m 2 is distributed as J\f{pL,a'^/2) and m 2 and S 2 are independent. 

Now we focus on the inductive step. For any t > 2, let Gt be the cr-algebra generated by the random 
variables s\ and {(X^+i — rhif} 2 <i<t-i- The recursive definition of the empirical variance in Lemma |8] 
immediately implies that the knowledge of {s 2 , ■ • ■, St} is equivalent to the knowledge of and {{Xi+i — 
'>hif} 2 <i<t-i and thus Pt = Gt- We assume (inductive hypothesis) 

2 

mt|0ty), (C.l) 

and we now show that (Q also holds for t+1. Let U = Xt+i — rht and V = iht+i — pL. Note that V 
can be written as P = -^(mt — /a) + -^(Xt+i — /i). Since samples are i.i.d., W-i-i is independent from 
(Xu...,Xt) and 


Xt+l\Gt ^ -^{h, 
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and thus Xt^i is also conditionally independent of mt given Qt. This implies that Xt+i and mt are jointly 
Gaussian given Qt (two random variables that are Gaussian and independent are jointly Gaussian, see [9| 
or also http://e11.wikipedia.0rg/wiki/Multivariate_normal_distribution#Joint_normality). This 
fact combined with the definition of U and V implies that U and V are conditionally jointly-Gaussian 
variables with zero conditional mean given Qt (they are jointly-Gaussian because they can be written as 
two independent linear combinations of the random variables Xt+i — fj, and rht — /r given Qt, see Q or also 
http://en.Wikipedia.org/wiki/Multivariate_normal_distribution#Affine_transformation). Fur¬ 
thermore, we can show that they are also conditionally uncorrelated given Qt since 


E 


UV\Qt] = E[(Xi+i-mt)( 


lx. t 

Xt+i -I- ——mt - fJ. 


-|- 1 


= E 


t 1 
1 


Qt 


(^{Xt+i - /r) - {mt - m)) - m) + 


Qt 


In t 
— 


t H” 1 


t+i t 


= 0. 


As a result, U and V are conditionally independent given Qt and 


(mt+i - ^J.)\Qt+l = {mt+i - ^J^)\{Qt, {Xt+i - mt)'^} = {mt+i - fi)\{Qt, U^} = V\{Qt,U'^} = V\Qt. 

Since the induction assumption is verified, we know that E[y|t/(] = 0 and V[y|t/t] = -b = 

2 

^Tj-. Finally, we deduce that 

2 

mt+i\Qt+i ^ 

which concludes the proof since Qt+i = Xt+i- □ 

We now study an adaptive algorithm that computes the empirical average rht and that at each time 
t decides whether to stop collecting samples or not on the basis of the sequence of empirical standard 
deviations § 2 , ■ ■ ■ ,St observed so far. Let T > 2 be a integer-valued random variable, which is a stopping 
time with respect to J-t- This means that the decision of whether to stop at any time before t -I- 1 (the 
event {T < t}) only depends on the previous empirical standard deviations § 2 , ■ ■ ■, St- From an immediate 
application of Lemma 0 we obtain 

E[{mT-^lf] = ^E[(mt-Mf|r = t]P[r = t] 

t>2 

= E[E[(™t -^rf\Xt,T = t]\T = t]¥[T = t] 

t>2 

= ^ E[E[{mt - \Tt]\T = t]E[T = t] ='£^F[T = t] = u^E [!' . 

t>2 t>2 


The previous result seamlessly extends to the general multi-armed bandit allocation strategies considered 
in Section [3] and S] 

Proof of Lemma\^ Let us now consider algorithms CH-AS and B-AS. For any arm k, the event {Tk^n > t} 
depends on the cr-algebra Xk,t (generated by the sequence of empirical variances of the first t samples of 
arm k) and also on the “environment” £-k (generated by all the samples of other arms). Since the samples 
of arm k are independent from £-k, we deduce that by conditioning on £-k Lemma [7] still applies and 


E[{flk,n - hf] = Es_t, [E[{ij,k,n - fif\£-k]] 




E 

L Lr, 


-\£-k 




■ 1 

-Tk,n- 


□ 
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We now report the proof of Theorem [3] 

oof of 
to obtain 


Proof of Theorem\^ We recall Lemma[3]and decompose the loss using the definition of ^ „((5) in order 


Lk,n = alE\-^]=alE\-^l{^}]+alE\^l{e}]. 

L-/fc:.rj, -I ^ J- k.n -I ^ J- k,n -I 


LTfc, 

From the bound in Equation (IB.20|) . we have (since n > 5K) 


"‘‘'[it'®] 


Tk,n 


< 


B 


C 


.-2K n^/^{n-2K) rPI^{n-2K) 


E 4JfS 2B 2C 
~ n~^ -n? 

E 4JfE 12 X 10® n2 14 X 10® ^2 

- n + ^ + „3/2 ^ ^=1^2 + l)(logn) + K ci(c2 + l)(logn) 

^E , 12.001 X 10® ,^2 , , ^2 , 14x10®,^2 , , ^2 

- ^ ci(c2 + l)(logn) H- K ci(c2 + l)(logn) 

^E 26.001 X 10® ,^2 / x2 

- ~ -372-Cl(C2 + l)(logn)®. 


(C.2) 


where we use the bounds on B and C in |Appendix B.3~^ Using the fact that S = n and Tk^n > 2, and 
by Lemma m that tells us P[^‘'] < 2nK6, we may write 


< Kaln < ciC2Kn 

L -/ fc,n -I 


-5/2 


Finally, combining Equations 1C.21 and 1C.31 and recalling the definition of regret, we have 

26.001 X 10® 


RniAs) < 
< 

< 


-ii'^ci(c2 + l)(logn)^ + ciC2A'n 


n3/2 

26.002 X 10® ,^2 , , ^2 

--375- K ci(c2 + l)(logn) 

105xl0®E,^2n ^2 

- Zm - ^ ’ 


(C.3) 


(C.4) 


since ci = 2S and C 2 = 1. 


(C.5) 


□ 
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